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5 ESTIMATING GENE NETWORKS USING INFERENTIAL METHODS AND 

BIOLOGICAL CONSTRAINTS 

Priority Claim 

This application claims priority to U.S. Provisional Patent Application No: 
60/529,146, entitled ESTIMATING GENE NETWORKS USING INFERENTIAL 
1 0 METHODS AND BIOLOGICAL CONTRAINTS, by Sascha Ott and Satoru Miyano, filed 
December 12, 2003, (Attorney Docket No. GENN-0101 lUSO) herein incorporated fully by 
reference. 

Field of the Invention 

15 This invention relates to estimating gene networks in biological systems using 

mathematical inferential methods using biological constraints. In particular, this invention 
relates to estimating gene networks using Bayesian estimation, with search fields limited by 
understanding of the biological constraints on the genetic system. 

20 BACKGROUND 

Inference of gene networks from gene expression measurements is a major challenge 
in Systems Biology. If gene networks can be inferred correctly, it can lead to a better 
understanding of cellular processes, and, therefore, have applications to drug discovery, 
disease studies, and other areas. Estimation of gene networks fi-om expression level 

25 measurements is one focus of Bioinformatics research. 

A frequently used approach to model gene networks are Bayesian networks. In 
Bayesian networks, the behavior of the network is described by a joint probability distribution 
for the expression levels of all genes in a relevant group of genes Scorn an organism. A joint 
probability distribution can be decomposed in conditional probability distributions using a 

30 directed acyclic graph, which we can call a network. A number of score functions have been 
proposed for the selection of the network, given gene expression measurement data. Work on 
new score functions exploiting previous knowledge is on-going. Networks are scored using 
score functions based on the likelihood of networks, given the data. Having selected a score 
function, the task of estimating a gene network is to find a network with minimal score. 

35 However, facing the NP-hard gene network estimation problem and a search space of 

super-exponential size, some have applied heuristic algorithms, for example, greedy 
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5 algorithms or simulated amiealing in order to estimate gene networks. Since heuristic 
algorithms do not provide any assurance of their accuracy, it remains unclear to which extent a 
network estimation will be biased by the computational method used. 

One can deduce an optimal gene network model with respect to some data for gene 
networks of about 30 genes or less. This result holds for any score function s : Gx2^ 

10 assigning a score to a gene g and a set of parents for g However, while optimal gene 
network models are the most likely models, there may still be very different models that 
have approximately the same likelihood with respect to some data, especially since gene 
expression measurements will in general provide only partial information about the gene 
network. Furthermore, even for a gene network of only 10 genes, there are about 

15 4. 1 7* 1 0** possible network models. Therefore, even optimal gene network models will in 
general not match the target gene network. 

SUMMARY 

In certain aspects of this invention, we present methods to estimate gene networks 

20 efficiently, though providing optimality of the estimated network in a restricted sense. First, an 
inferential model of possible gene networks is created. Second, a biologically relevant 
subspace of the search space is selected. Then, we find optimal solutions in the selected 
subspace by repeatedly applying an algorithm that computes small gene networks optimally. 
The selection of the subspace can be done by applying biological knowledge or, if no previous 

25 knowledge is available, for example, by using clustering methods. 

In certain embodiments, such an approach allows one to make use of biological 
knowledge by restricting the search space in a biologically slight but computationally effective 
way. The approach improves optimality with respect to the restricted search space and 
therefore provides a clear distinction of the part of the estimation that is done 

30 heuristically/empirically, and the part that is done optimally. 

In other embodiments, using Bayesian networks, the behavior of the gene network 
is modeled as a joint probability distribution for all genes. This allows a very general 
modeling of gene interactions. The joint probability distribution can be decomposed as a 
product of conditional probabilities, representing the regulation of a gene by other genes, 

35 This decomposition can be represented as a directed acyclic graph. Bayesian network 
models can be used to find biologically plausible gene networks. 
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5 Gene expression information can be produced using methods known in the art, 

including, for example, microarray analyses. As these methods are known in the art, we 
need not describe them further. However, it can be appreciated that the reliability of 
network relationships can depend upon the accuracy and/or reliability of the information 
obtained on gene expression. Therefore, in some embodiments, replicate assays, and 

10 controlled studies are desirable to increase the reliability of gene expression data. 

In other aspects, we provide the theoretical basis and an algorithm for the 
enumeration of optimal and suboptimal networks in the order of their likelihood. In further 
embodiments, we compared optimal gene network models to the available knowledge about 
existing gene networks. Instill further embodiments, we present an approach to extract the 

1 5 reliable part of optimal gene network models and apply this approach to the available data of 
B. subtilis and E. coli in order to compare gene network models of related species 

Our results show that the partial networks identified by our approach are in 
significantly better agreement with the knowledge than an optimal network model itself Since 
we base our approach on the best n gene network models, one can derive conclusions from 

20 these estimations more reliably than from methods that rely on heuristic methods alone. 

This invention also includes systems that carry out inferential analysis of gene 
networks, comprising at least an input device adapted to receive data on gene expression from 
a number of different genes, and a processor adapted to run a program for inferring gene 
network relationships. In additional embodiments, a system includes an output device adapted 

25 for displaying, printing, or sending results of gene network analysis to remote locations. 



BRIEF DESCRIPTION OF THE FIGURES 
This invention is described with reference to specific descriptions of embodiments of 
the invention and to the figures, in which: 
30 Figure 1 depicts gene network predicted by Algorithm 3 of Example 1 . 

Figure 2a depicts time course data obtained for B. subtilis. 
Figure 2b depicts disruptant data obtained for B. subtilis. 
Figure 3 depicts data obtained from E. coli. 

Figure 4 depicts gene network relationship motifs extracted for E. coli. 
35 Figure 5 depicts gene network relationship motifs extracted for B. subtilis. 
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5 DETAILED DESCRIPTION 

The elucidation of gene interactions in cells is an important focus of research in recent 
years (20). In certain embodiments, one can model gene networks using Bayesian networks 
(4,5,6,11, 12,14,16,19), In Bayesian networks, the behavior of the network is described by a 
joint probability distribution for the expression levels of all genes. The joint probability 

10 distribution must be decomposed in conditional probability distributions using a directed 
acyclic graph, or "network." A number of score fimctions have been proposed for the selection 
of the network (3,4,11) given gene expression measurement data. Work on new score 
fimctions exploiting previous knowledge is on-going (13,23). Having selected a score 
fimction, the task of estimating a gene network is to find a network with minimal score. 

15 However, facing the NP-hard gene network estimation problem (1) and a search 

space of super-exponential size (17), researchers have applied heuristic algorithms like greedy 
algorithms (1 1) or simulated annealing (7) in order to estimate gene networks. Since heuristic 
algorithms do not provide any assurance on their accuracy, it remains imclear to which extent 
the network estimation will be biased by the computational method. 

20 In certain aspects of this invention, we present methods to estimate gene networks 

efficiently, though providing optimality of a network in a restricted sense. First, an inferential 
model of possible gene networks is created. Second, a biologically relevant subspace of the 
search space is selected. Then, we find optimal solutions in the selected subspace by 
repeatedly applying an algorithm that computes small gene networks optimally (15). The 

25 selection of the subspace can be done by applying biological knowledge or, if no previous 
knowledge is available, for example, by using clustering methods. 

In certain embodiments, such an approach allows one to make use of biological 
knowledge by restricting the search space in a biologically slight but computationally effective 
way. The approach improves optimality with respect to the restricted search space and 

30 therefore provides a clear distinction of the part of the estimation that is done 
heuristically/empirically, and the part that is done optimally. 

EXAMPLES 

The following Examples are intended to illustrate aspects of the invention and are 
35 not intended to be limiting. Moreover, for each Example, the references are cited in 
brackets and refer to the list of references at the end of each Example. Other embodiments 
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5 of this invention can be produced by persons of skill in the art without undue 
experimentation. All of those embodiments are included within the scope of the invention. 



Example 1: Finding Optimal Gene Networks Using Biological Constraints 

The running time needed to find an optimal solution in the selected subspace is CHn% 
10 where n is the number of genes, therefore allowing estimation of arbitrarily large gene 
networks. We note that this approach can be applied using any score fiinction s of 
functionality s: Gx2^ —^IR, which is the case for all score functions within the Bayesian 
network ftamework, and for most others as well. 

We have implemented our method and applied it to yeast data and Bacillus subtilis 
1 5 data. Comparison of the estimated network for Bacillus subtilis to biological knowledge yields 
a significant agreement. 

Necessary notations are introduced in Section 2, the method is explained in detail in 
Section 3, and results of network estimations are discussed in Section 4. 



20 1.1. Preliminaries 

Throughout the example, we will use G to denote the set of genes, for which a gene 
network is to be predicted, and assume we are given a score function s : Gx2^ -^IR that 
assigns a score to a gene g ^ G and a set of parent genes A^G. Given a network A^, the score 
of N is defmed as 5'cor^(A0=def where P^(g) denotes the set of ^ 's parents in N. 

25 We introduce some notations fi*om (15). 

Definition 1.1: 

We define F:Gx2^ as F(g^)=di^i[mnB^s(g,B), for all ge G and Ac.G, 

F(g,A) is the best choice of parents forg when choice is restricted to^. Let us use ;rto denote 
30 apermutation ;r:{l,... \A\} — ►^ofasety4cG,andn^todenotethesetofallpermutationsof 
A.. 



35 



Definition 1.2: 

Let^cG. We define as lY^-^IRas: 
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5 



forall;ren^. 



Definition 1.3: 



We define 




(2) 



for all ^cG. 

In (1 5) it was shown that n) is an optimal network in tfie space of all networks for 
1 5 which M{A) is a topological sort and 1 is a permutation which is a topological sort for at least 
one optimal network on A. 

We state the following algorithm from (15) which we will use as a subroutine in this 



Step 1 : Compute = s{g,(l>) for all g 6 a 

Step 2: For all A^G.A^ (p and all ^ e G compute A) as 

rmn{s{gA)MT^sAF{gA'{a})} • 

Step 3: SetM{(/>)=<p. 

Step 4: For all A^G^ A^ <p , do the following two steps: 
Step 4a: Compute 

g* =arg min^A {F(gA-{g})+Q'-^'\M{A-{g}))y 
Step 4b: For all 1 < / < |, set M(^)(0=M(^-{g*})(0, and M{A) ( |^|) = g*. 

Step 5: return Q^(M{G)), 



Algorithm 1 . 1 computes function F in Step 1 and Step 2, and function M in Step 3 
and Step 4, making use of In Step 5, M was used to compute the score of optimal networks. 
This algorithm was applied to find optimal gene network models for gene networks of up to 
about 30 genes, but becomes less reliable for large gene networks. 



work. 



Algoritiini 1.1 



25 
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5 Theorem 1,1(15) Algorithm 7, 1 finds the optimal network using 0(n 2") 

dynamic programming steps 

1.2 Method 

We first proved a theorem, from which we then derive two algorithms for the 
1 0 estimation of large gene networks. 

1*2.1 Theoretical Basis 
The following derivation prepares a basis for our approach in this Example. 



IS Theorem 1.2 

Let m, c eIN. For each g e G» let Cg c G - {g} be a set of candidate parents. 
Assume that the following two conditions hold: 

1. \Cg\^mforallg eG. 

2. The strongly connected components of the graph induced by Cg, g eG, are not 
20 larger than c. 

Then optimal networks with respect to the selected candidate parents can be found in 
timeO{\G\), 



Proof 

25 Let L = (G,E) denote the graph on the set of genes G that is induced by g ^G. 

That means E contains exactly the edges (h,g) for which h e Gg, holds. Let . , . ,S„ ^ 
(7 denote the strongly connected components of £. Now let Z '={{Su . . ■ Ajj^") denote the 
graph on the strongly connected components with E. defined as: 
E' = def{{SU Sj)\\i ^g^Sf (gh) eE), 
30 Then, by the definition of strongly connected components there is no cycle in Z, 

W.l.o.g. let us assume that 5*1, ... A is a topological sort for L 'which means that there is no 
edge (Si, Sj)inL' withy < 

Algorithm 1.1, with slight modifications, can be apphed to compute optimal networks 
for G with respect to the selected candidate parents. In order to apply Algorithm 1.1 to a 
35 strongly connected component Si, we modify the algorithm in the following way: 

1 . hi the computation of F in Step 1 and Step 2, we only compute F(g,A) for all g 
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5 e Si and all A^Cg. 

2. We replace the term F(g, ^ ~ {g}) in Step 4a (see definition of Algorithm 1) by 

^fe(Q = 5^)U(Qn^))■ 

These two changes introduce the restrictions for candidate parents to the algorithm, 
10 and allow g e Si to have parents outside of Si. This does not affect the correctness of 
Algorithm 1.1, since the proof for Theorem 1.1 (15) can be done for the modified 
Algorithm 1.1 in the same way as for Algorithm 1 . We apply Algorithm 1 . 1 modified in 
this way to each strongly connected component Si in an arbitrary order yielding partial 

networks iV, = (G.Ei) for each / e 1 , . . . , n. Then we return the network iV = (G, U y^j 

15 By Theorem 1.1, each partial network Ni optimal. From the acyclicity of L ' it follows that 
N is acyclic. Thus, using the definition score(N) = Y.geGs(g,P^(g)X ^is an optimal network. 

Since we have 1 5/ 1 ^ c and | Q | ^ m, the computation time for one call of Algorithm 
1 becomes bound by some constant. Therefore, applying Algorithm 1.1 to all strongly 
connected components can be done in 0(1G|), which completes the proof 

20 We note that Algorithm 1 . 1 and the application of Algorithm 1 , 1 as a subroutine as 

defined in the proof can be implemented in a way that allows the enumeration of all optimal 
networks and also the enumeration of suboptimal networks in the order of their rank. The 
computation time will increase, depending on the number of networks to compute, but stays 
feasible. This can be valuable in order to assess the stability of estimated networks under 

25 minor changes of the score. 

1.2.2 Prediction of Large Gene Networks Without Previous Knowledge 
By Theorem 1.2, in order to allow an effective network prediction, one can select 
candidate parents for each gene such that the strongly connected components in the graph 
30 containing all candidate interactions can be bound by a constant. Since gene networks are 
assumed to consist of highly connected blocks, which are sparsely connected to each other 
(1 0, 1 8), this restriction seems to be satisfied in many research settings. Therefore, Theorem 
1 .2 provides the basis for a general approach of gene network estimation. 

In this work, we give two examples of algorithms designed to exploit Theorem 1 .2. The 
35 first one (Algorithm 1 .2) assumes no previous knowledge and detects highly cormected blocks 
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5 by clustering. The second algorithm (Algorithm 1 .3) applies biological knowledge to fulfill 
the restrictions of Theorem 1.2. 

Algorithm 1.2 

Step 1 : Cluster genes in G such that no cluster is larger than c genes. 
Step 2* clusters by decreasing size: C\, . . . ,C„. 

Step 3: For each i ^ {1, . . . , «} and for each g e Q, select up to m candidate parents 

fi-omC, U.-.U^^i- 
Step 4: Compute an optimal gene network model using Theorem 1 .2. 

1 0 Since the candidate parents for all genes g are chosen from the cluster g is contained 

in or previous clusters, no cycle in the graph induced by Cg, g eG can span two clusters. 
Therefore, the strongly connected components are not larger than c, because | C/| ^ c for all 
clusters Q, which shows the correctness of Algorithm 1 .2. 

Genes that belong to the same cluster are correlated and should therefore form a 
1 5 densely connected part of the gene network, while genes that belong to different clusters have 
a lower correlation and are therefore unlikely to be connected in the gene network. We use 
k-means clustering with the Pearson correlation as distance measure. 

The rational for sorting the clusters by decreasing size is as follows. The clusters at 
the beginning of the sorting have the strongest restriction for the selection of candidate 
20 parents, since there are few previous clusters. To maximize the number of genes, fi*om which 
candidate parents can be selected, big clusters should be put at the beginning. 

We note that the restrictions of the subspace imposed by Step 1 and Step 2 still allow 
any two genes to be connected, restricting only the direction of edges for some pairs of genes. 
There are many reasonable criteria for the selection of a candidate parent h for a gene 
25 g in Step 3. Examples are the correlation of genes or s(g, {h}). In our implementations, we 
made use of the latter criteria. 



1.2.3 Prediction of Large Gene Networks Using Previous Knowledge 

If there is previous knowledge about the gene network concerning the densely 



wo 2005/059707 



PCT/US2004/042027 



-10- 

5 connected components and the direction of the interaction of genes in different components, 
the following algorithm can be applied. 

Algorithm 1.3 

Step 1: Group genes in G in groups C, with | C, | ^ c and sort them according to biological 
knowledge: Ci . . . , C„ 

Step 2: For each / e { 1 , . . . ,«} and for each gene g e C,, select up to m candidate parents 
fromC, 

Step 3: Compute an optimal gene network model using Theorem 2. 

10 As for Algorithm 1 .2, the correctness of Algorithm 1 .3 follows directly fix)m the way 

candidate parents are chosen in Step 2. An example where not only the densely connected 
components, but also their order is known, are genes that are activated in specific phases of 
the cell cycle. In situations where the knowledge about densely connected components and/or 
their order is partial knowledge, a combination of Algorithm 1 .2 and Algorithm 1 .3 can be 

15 used. 



1.3 Results 

We have implemented Algorithm 1 .2 and Algorithm 1 .3, making use of an existing 
implementation of Algorithm 1.1(15), and publicly available clustering software (8). As score 
20 fimction s, we chose the BNRC score (11), since it can model non-linear gene interactions and 
can handle the gene expression data without discretization. 

1.3.1 Application of Algorithm 1.2 to Bacillus subtilis data 

We have applied Algorithm 1 .2 to Bacillus subtilis data (9). We selected the data for 
25 6 sigma factors and 79 operons known to be regulated by the chosen sigma factors (21). The 
expression ratios of operons were defined as the average of the expression ratios of their 
respective genes. Therefore, we have 79 known regulatory interactions in this network of 85 
genes resp, operons and will identify a significant part of these, if Algorithm 1 .2 has predictive 
power. We set parameter m, the maximal nimiber of selected candidate parents, to 15, and 
30 parameter c, the maximal size of clusters, to 25 . The actual size of clusters computed in Step 1 
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5 ranged from 8 to 24. For this computation, we used a Sun Fire 1 5K supercomputer with 96 
CPUs, 900 MHz each, for about one hour. 

In order to evaluate the biological significance of the estimated network, we 
computed the distance of each operon to each sigma factor in the network predicted by 
Algorithm 1.2, where we defined the distance as the minimal number of edges on an 

1 0 undirected path connecting the operon and the sigma factor. When the correct sigma factor 
was closer to an operon than all other sigma factors, we rated the regulatory relationship as 
detected. This allows one to compute the p-values for our estimation result, since for a 
meaningless predictor the probability of detecting regulatory interactions would be at most 
1/6. The actual probability is lower than «equation:GIW20146», because we count 

1 5 breaking ties as undetected. 



TABLE 1.1: 


Application of Algorithm 1.2 to Bacillus subtilis data. 


Sigma Factor 


Relations Found 


Known Relations 


p- value 


sigE 


8 


22 


0,021 


sigD 


5 


16 


0.113 


sigW 


12 


21 


20147 


sigH 


1 


11 


0.865 


sigX 


0 


5 


1.0 


sigF 


0 


4 


1.0 



Table 1 , 1 shows the result of this evaluation scheme. We observe that the estimated 
gene network is of significant agreement to the biological knowledge for two sigma factors, 



20 20148 and 20149. This shows that the estimation result contains valuable biological 
information. We note that a majority of the undetected regulatory relationships were breaking 
ties, and therefore the distance of operons to their correct sigma factors was still minimal, but 
one or more of the other sigma factors were as close. We observe that the especially strong 
significance for 20150 is consistent with a result from [CITE:dehoon03b] for differential 

25 equation networks. Therefore, it is likely that the data set is more informative for the region of 
the gene network that is around 20 1 5 1 than for other regions. 
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5 We note that there is little work on the problem to evaluate the quality of gene 

network estimations in a principled way. To our knowledge, there is only one other 
publication that gives a well-defined p-value to prove the significance of estimations 
[CITE:dehoon03b]. There are several problems that one encounters when doing a principled 
evaluation of gene network estimations. Examples are the incompleteness of knowledge about 

1 0 gene networks, and the uncertainty about what part of a gene network is working under which 
experimental conditions. Furthermore, if structural differences between the true gene network 
and the estimated network are found, it should be distinguished between substantially wrong 
differences and admissible differences like transitive edges. The above approach of analyzing 
shortest paths in estimated networks is one way to tackle some of these problems, but work on 

1 5 these problems should be further pursued. 

1.3.2 Application of Algorithm 1.3 to CeU Cycle Data 

Algorithm 1.3 is based on the application of previous knowledge in order to find a 
biologically meaningful subspace. Here, we apply Algorithm 1 .3 to RNA microarray data for 

20 Saccharomyces cerevisiae obtained firom time-course experiments using synchronised cell 
cultures (22). This data set can be expected to contain some information about the gene 
network that works during the cell cycle of yeast. 

It is known that the three gene pairs clnl/cln2, clb5/clb6, and clbl/clb2 are activated 
specifically during the Gl/S phase, the S phase, resp. the M phase of the cell cycle (2). By a 

25 clustering analysis based on this fact several genes were predicted to be predominantly 
activated during one of these phases (13). We selected a set of 43 genes, divided them into 
three groups, each group corresponding to one part of the cell cycle. We sorted these groups 
according to flie time order of the ceil cycle phases (Step 1 of Algorithm 1.3), set the 
maximum number of candidate parents (parameter) to 20, and applied Algorithm 1.3 to 

30 estimate a gene network. The computation was done using a single CPU with 1 .9 GHz for less 
than one hour. Figure 1 shows the estimation result. 

We observe a total number of 87 predicted gene interactions. Table 1.2 shows the 
number of directed interactions fi-om a first gene to a second gene partitioned by the groups 
(rows) and (colunms) the genes belong to. We see that 52 out of 87 interactions are within 

35 groups, well reflecting the expectation that most interactions occur between genes that are 
predominantly active during the same phase of the cell cycle. 35 interactions are estimated for 
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5 pairs of genes from different groups. The number of interactions between two groups is 1 3 for 
each of the temporal neighbors Gl/S and S/G2, respectively, S/G2 and M, but only 9 for the 
interactions from Gl/S phase to M phase. Therefore, though the gene network estimation is 
based on a decomposition of the set of genes in three groups, interactions between groups are 
well accounted for. 

10 

TABLE 1,2: Number of Predicted Interactions by Gene Groups 



9 


13 


18 


M 


13 


16 




S/G2 


18 






Gl/S 


Gl/S 


S/G2 


M 





We counted the number of interactions for each gene, which is the number of parents 
of 20158 plus the number of children of that gene. The genes with the highest number of 

1 5 predicted interactions in the estimated gene network (Figure 1 ) are Usted in Table 1 .3, which 
also gives the molecular function of these genes as annotated by the Saccharomyces Genome 
Database. We observed that all genes with at least 7 predicted interactions have regulatory 
activity on the transcriptional level or have a yet unknown ftmction. Among the three genes 
with 6 predicted interactions one gene {hcml) had regulatory function on the transcriptional 

20 level, while the two others are cell cycle dependent signaling proteins. We conclude that all 
genes with a higher number of predicted interactions have a known active regulatory role or 
are of unknown function. This showed that the estimated gene network contains at least some 
degree of biological information. 

The observation that the important genes cln2 and clb5 have a lower rank within the 

25 group of genes in Table 1 .3 may be explained by their role in signaling which is observable 
from mRNA measurements only in an indirect way. Interestingly, swiS is predicted to be 
regulated by fkh2 which is a known regulatory relation [CITE:zhu00], Therefore, the activity 
of the other genes on the transcriptional level is expected to be more easily observable. 
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5 TABLE 1.3: Molecular Functions of Genes With at Least 6 Predicted Interactions 



22cmGene 


Number of 
Interactions 


25cmMolecular Function 


yfrOHw 


8 


unknown 


yoxl 


7 


DNA binding, specific transcriptional 






repressor activity 


stbl 


7 


transcriptional activator activity 


htb2 


7 


DNA binding 




7 
/ 


TnU^TA VklnHtno 


swi5 


7 


transcriptional activator activity 


ydrl49c 


7 


unknown 


hcml 


6 


specific RNA polymerase II 






transcription factor activity 


cln2 


6 


cyclin-dependent protein kinase. 






intrinsic regulator activity 


clb5 


6 


cyclin-dependent protein kinase. 






intrinsic regulator activity 



Table 1 .4 shows on the contrary the genes with the lowest number of predicted 
interactions. We find several genes with metabolic or signaling functions, one gene of 

10 unknown function, one mRNA binding gene {stol), and one transcription factor {sM;i4). Since 
stol is known to be involved in nuclear splicing, a regulatory role is less likely. Therefore, 
only one out of the 11 genes with lowest number of predicted interactions has a known 
regulatory function on the transcriptional level, which is biologically plausible, since many 
signaling events will not be observable firom mRNA measurements. We conclude tiiat gene 

1 5 network estimations can give more insight to the function of genes than mere clustering 
results. 

TABL£ 1.4: Molecular Functions of Genes with at Most 2 Predicted Interactions 



20 
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22cinGene 


Number of 
Interactions 


25cmMolecular Function 


sco2 


1 


unknown 


stol 


1 


mRNA binding 


bbpl 


2 


structural constituent of cytoskeleton 


surl 


2 


transferase activity, transferring glycosyl groups 


clb3 


2 


cyclin-dependent protein kinase, intrinsic regulator 






activity 


clb6 


2 


cyclin-dependent protein kinase, intrinsic regulator 






activity 


adh2 


2 


alcohol dehydrogenase activity 


ino80 


2 


ATPase activity 


swi4 


2 


transcription factor activity 


smcl 


2 


AT DNA binding, ATPase activity. 






DNA secondary structure binding. 






double-stranded DNA binding 


smc3 


2 


ATPase activity 



5 



We herein provide a theoretical basis for a general approach to estimating large gene 
networks efficiently. Algorithms following this approach consist of two parts. The first part is 
a heuristic or empirical method to identify a biologically meaningfiil subspace of the search 
space. The second part is the optimal estimation of a gene network within the selected 
10 subspace. 

We have shown that this approach allows one to estimate meaningfiil gene networks, 
and to apply biological knowledge appropriately and effectively, while the computation time 
does not impose practical limitations for the number of genes. 

Since the approach does not depend on a certain score or a certain kind of gene 
1 5 expression data, it is generally applicable. Development of new scores incorporating previous 
knowledge such as sequence information (23) or structure information (1 3) is a useful way to 
fiirther increase the predictive power. Other ways to do the subspace selection heuristically 
include construction of subspaces around gene networks estimated by heuristics, or averaging 
the gene expression values for clusters and applying Algorithm 1 to find an optimal gene 
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5 network for the clusters, which can then be used to find an ordering for the clusters (instead of 
simply sorting by size as in Step 2 of Algorithm 1 .2). 
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Example 2: Finding Optimal Models for Small Gene Networks 
10 Introduction 

Inference of gene networks from gene expression measurements is a major challenge 
in Systems Biology. If gene networks can be inferred correctly, it can lead to a better 
understanding of cellular processes, and, therefore, have applications to drug discovery, 
disease studies, and other areas. 
1 5 Bayesian networks are a widely used approach to model gene networks (see refs. 3, 4, 

7, 9, 1 1 , 1 3 , 1 4, 1 7). In Bayesian networks, the behavior of the gene network is modeled as a 
joint probability distribution for all genes. This allows a very general modeling of gene 
interactions. The joint probability distribution can be decomposed as a product of conditional 
probabilities P{Xg\X^^ . . . ,Xn) > representing the regulation of a geneg by some genes gi,. . 

20 . This decomposition can be represented as a directed acyclic graph. The Bayesian 
network model has been shown to allow finding biologically plausible gene networks (see 
refs. 4, 9). 

However, the difficulty of learning Bayesian networks lies in its large search space. 
The search space for a gene network of n genes is the space of directed acyclic graphs with 
25 n vertices. A recursive formula as well as an asymptotic expression for the number of directed 
acyclic graphs with n vertices (c„ ) was derived by Robinson (see ref 15). We state the 
asymptotic expression here: 

^J-(«-l) 

cn = — — \r -0.57436; z ^ 1.4881 

For example, there are roughly 2.34 • 10^ possible networks with 20 genes, and about 
30 2.71 • 10*^^ possible solutions for a gene network with 30 genes. Even for a gene network of 
9 genes (search space size roughly 1.21 • lO'^), a brute force approach would take years of 
computation time even on a supercomputer. Moreover, it is known that the problem of finding 
an optimal network is NP-hard (see ref 1), even for the discrete scores BDe (see refs. 2, 3) 
and MDL (see ref 3). Therefore, researchers have so far used heuristic approaches like 
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5 simulated annealing (see ref. 8) or greedy algorithms (see ref. 9) to estimate Bayesian 
networks (see ref. 18). 

However, since the accuracy of heuristics is uncertain, it is difficult to base 
conclusions on heuristically estimated networks. In order to overcome this problem, we have 
analyzed the structure of the super-exponential search space and developed an algorithm that 
1 0 finds the optimal solution within the super-exponential search space in exponential time. This 
approach is feasible for gene networks of 20 or more genes, depending on the concrete 
probability distribution used. Furthermore, adding biologically justified assimiptions, the 
optimal network can be inferred for gene networks of up to 40 genes. 

Overcoming the uncertainties of heuristics opens up the possibility to compare 
1 5 statistical models with respect to their power to infer biologically accurate gene networks. 
Also, this method is a valuable tool for refining gene networks of known functional groups of 
genes. 

Methods are presented in Section 2.2. In Section 2.3 we present results of an 
application of this method, which show that it can estimate gene networks biologically 
20 accurate. 

2.2 Methods 

2.2.1 Preliminaries 

Throughout this section, we assume we are given a set of genes G and a network 
25 score fimction as used by several groups (see refs. 4, 9, 1 7), i.e. a fimction 5 : G x 2 — that 
assigns a score to a gene geG and a set of parent genes A^G. Given a network the score of 
N is defined as score(N) = Y^€d&{gJ^(g)\ where P^(g) denotes the set of g 's parents in N, 

Examples: 

30 1 . BDe score (see refs. 2, 3) 

The score is proportional to the posterior probability of the network, given the data. 

When the BDe score is used, the microarray data needs to be discretized. 

2. MDL score (see ref 3) The MDL score makes use of the minimal description length 

principle and also uses discretized data. 
35 3 . BNRC score (see ref 9) The BNRC score uses nonparametric regression to capture 

nonlinear gene interactions. Since the data does not need to be discretized, no information is 
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5 lost. The task of inferring a network is to find a set of parent genes for each gene, such that 
the resulting network is acycHc and the score of the network is minimal. 

Below, we introduce some notations needed to describe the algorithm. 



Definition 2.1: F 

1 0 We define F: G x 2^->/? as F(g4)=d^mB^s(gfi) for aU G and A^G, 

The meaning of F(g^) is, by the definition, the optimal choice of parents for gene g, 
when parents have to be selected firom the subset A. For every acyclic graph, there is an 
ordering of the vertices, such that all edges are oriented in the direction of the ordering. 
Conversely, when given a fixed order of G, we can think of the set of all graphs that comply 
1 5 with the given order, as we do in the next definition. 

An ordering ofaset>4cG can be described as a permutation n:{ 1,...,|/4|} —>A.ljQt 
us use to denote the set of all permutations of A, 



Definition 2.2: ir -linearity 

20 Let i4£:G and TrFI^. Let Nsl4 xAhea. network. We say Nisir -linear iff for all (gM) e 

NTf^(g)<Tf\h) holds. 

Now we use the above definitions and define function g^, which will allow us to 
compute the score of the best tt -linear network for a given ir, as we show below. 

25 Deflnition 2.3: 

Let A^G. We define : ->/? as 

e^CTc) =cief Z ngAheA\n-\h) <n-\g)}). 
geA 

for all in^. 



30 



If we can compute the best tt -linear network for a given permutation tt using 
functions F and Q, then what we need to do in order to find the optimal network is to find the 
optimal permutation tt, which yields the global minimum. Formally, we define fiinction Af for 
this step. 
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5 



Definition 2.4: 



M 



We define M:2^^^^q as 



10 



for all ^cG. 



Algorithm 2.1 



Using above notations, algorithm 2. 1 can be defined as follows. 



15 



Step 2: 



Step 1: 



Compute F(g,4>) = s{g,<p) for all ^ 6 G. 

For all yicG, 0 and all ^ e G compute F{g, A) as 



Step 3: SetAf(0)=<^. 

Step 4: For all A^G,A^ 0 , do the following two steps: 
Step 4a: Compute 

g* =3xg min^.A iF(gA-{g})-^(/'^'\M(A-{gm. 
Step 4b: For all l<i<\A |, set M(A)(iy=M(A-{g*})(i), and M(A)i\A\) = g*. 

Step 5: retum Q%M(G)), 

In the recursive formulas given in Step 2 and in Step 4, we want to compute the 
fiinction Fresp. M for a subset A^G of cardinality m~\A\^ and need function values of 
function Fresp. Mfor subsets of cardinality w-1. Therefore, we can apply dynamic 
20 programming in Step 2 as well as in Step 4 to compute functions F resp. M for subsets A of 
increasing cardinality. In the recursive formula in Step 4, first the last element g of the 
permutation M(A) is computed in Step 4a, and then M{A) is set in Step 4b. 



from the definition of F, Therefore, after execution of Step 1 and Step 2, the values of 
function F for all genes g and all subsets ^4 □ G are stored in the memory. Before proceeding to 
Step 3 and Step 4, we state a lemma on the meaning of function Q^, 



25 



Correctness and Time Complexity 

The correctness of the recursive formula in Step 2 of algorithm 2. 1 follows directly 
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Lemma 2A 

Let A^G and idl^. Let //c^ be a tt -linear network with minimal score. Then, 
Q*(Try=score(N*) holds. 



Proof. In a ^ -linear graph, a gene g can only have parents A, which are upstream in 
10 the order coded by tt, that is, Tf\h}<if^(g), Therefore, when selecting parents for gy we are 
restricted to B = { h^A\n'^\h) <n^\g) } , and is the optimal choice in this case. 

Since in a tt -linear graph, all edges comply with the order coded by tt, we can choose parents 
in this way for all genes independently, which proves lemma 2. 1 . 

Using Lemma 2. 1 , we prove that function M can be computed by the formula given 
15 in Step 4. 



Lemma 2.2 

ZeMi^.£e^^* = argmingeA(Frg,^- {g}) + Q^"^«*(M(A- {g}))). 
Define xen^ by •7i(0=M^-{g*})0% and 7c(M|) =g. 

20 Then, Tr=M(A). 



Proof. Let if' eH^. By the definition of M, we have to show Q^(n)Q*('K), Let * be an 
optimal TT^ -linear network, M* be an optimal tt -linear network. Then, by Lemma 2.1, 
Q*(n)Q!^(Tr) is equivalent to score N* score M*. Let us denote the last element of ?r as 
25 A = 7i' ( |/4 1) . We note that for any QP{M{B)) is the score of a global optimal network on 
B by above definitions. Therefore, we have: 

score {I^) = s(h^ P^' (h)) ^l^^^A-m s(gP^*(g)) 

^ s(h. (h)) +Q!'-<'^(M(A - {h})) 

^ F(hM-{h})^Q^~^'^(M(A-{h})) 

30 ^ min,,^ F^, A - {h}) + Q^-^'^(M(A - {h})) 

= F(g\ A - {g}) + Q"-^'*^ (M(A ~ {/})) 
score{N\ 
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5 which demonstrates lemma 2.2. 

Since Q can be directly computed using F, the algorithm can compute Q%M{G)) in 
Step 5. Finally, Q^(M(G)) is the score of an optimal Bayesian network by definition, which 
shows the correctness. 

If the information of the best parents is stored together with F(g^) for every gene 
10 g and every subset AcG, the optimal network can be constructed during the computation of 

Theorem 2.1 Optima! network without constraints 
The algorithm finds the optimal network using 0(n • 2") dynamic programming steps. 

1 5 Proof. The dynamic programming in Step 1 and Step 2 requires 0(n - 2") 

( « = |G| ) steps and in each step one score is computed. In the dynamic programming in Step 
3 and Step 4 0(T) steps are needed, where each steps involves looking up some previously 
stored scores. Note that the fiinction does not need to be actually computed in Step 4a, 
because Q^' can be stored together with M(A- {g} ) in previous steps. Therefore, the overall 

20 time complexity is 0(n 2"). 

In biological reality, while the number of children of a regulatory gene may be very 
high, the number of parents can be assumed to be limited. When we limit the number of 
parents, the number of score calculations reduces substantially, allowing the computation of 
larger networks. 

25 We state the following corollary, which is practically very meaningful (see Section 

2.3). 

Corollary 2.1 Let eIN be a constant. Optimal networks, in which no gene has 
more than m parents, can be found in 0{n -2") dynamic programming steps, [quote] If we do 
30 not want to limit the number of parents by a constant, but instead can select for each gene a 
fixed number of candidate parents, the complexity changes as follows. 

Corollary 2.2 Let eIN be a constant. For each ge Gy let (y^G be a set with (?.s7w. 

Optimal networks, in which each gene ^has parents only in Q, can be found in 
3 5 0(2") dynamic programming steps. 
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5 Proof. Since the parents of each gene are selected from a set of constant size, the 

complexity of the dynamic programming in Step 1 and Step 2 becomes constant. Therefore, 
the overall complexity becomes 0(2"). 

We note, that the two applications of dynamic programming in our algorithm can be 
implemented as a single application of dynamic programming, because when we compute 
1 0 function M for a set of size /n, we only need function values of function F for a set of size w-l . 
Therefore, only the function values for functions F and M for sets of size m-l and m need to 
be stored in the memory at the same time. This is practically meaningful to reduce the required 
amount of memory. 

We also note that the algorithm can be modified to also compute suboptimal 
1 5 solutions. Computing the second-best or the third-best network might be valuable in order to 
assess the stability of the infered networks imder marginal changes of the score. 

Results 

Algorithm 2.1 described above was implemented as a C-H- program. As scoring 
20 functions, existing implementations of the BNRC score, the BDe score and the MDL score are 
used- All three approaches (Theorem 2.1, Corollary 2.1 and 2.2) were implemented. 

We applied the program to a dataset of 173 microarrays, measuring the response of 
Saccharomyces cerevisiae to various stress conditions. (See ref 5). 

25 Application to Heat Shock Data 

From the dataset we selected 15 microarrays from 25*Cto 37"Cheat shock 
experiments and 5 microarrays from heat shock experiments from various temperatures to 
37''C Then we selected a set of 9 genes, which are involved or putatively involved in the heat 
shock response. Figure 2 shows the optimal network with respect to the BNRC score. 

30 We observe that the transcription frictor MCMl is estimated to regulate three other 

genes, while it is not regulated by one of the genes in this set, which is plausible. The second 
transcription factor in our set of genes, HSFJ, is estimated to regulate three other heat shock 
genes. It is also estimated to be regulated by a HSP70~pTotem ( SSAJ), which was reported 
before (see ref 16). Another chaperone among these genes, SSA3, also seems to play an 

35 active role in the heat shock response and interacts with SSAI and HSP104, coinciding with a 
report by Glover and Lindquist (see ref 98). 
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5 Table 2. 1 dq>icts results of such analysis. Overall, the result is biologically plausible 

and gives an indication for the active role of the chaperones SSAI and SSA3 during the heat 
shock response. We conclude that optimally inferred gene networks are meaningful and useful 
for the elucidation of gene regulation. 

1 0 Table 2.1 Genes Involved in Heat Shock Responses 



gene 


annotation 


HSFl 


heat shock transcription factor 


SSAI 


ER and mitochondrial translocation, cytosolic HSP70 


SSA3 


ER and mitochondrial translocation, cytosolic HSP70 


fflGl 


heat shock response, heat-induced protein 


HSP104 


heat shock response, themiotolerance heat shock protein 


MCMl 


transcription, multifunctional regulator 


HSP82 


protein folding, HSP90 homolog 


YR02 


unknown, putative heat shock protein 


HSP26 


diauxic shift, stress-induced protein 



Computational Possibilities and Limitations 

While even networks of small scale like the network infered in Section 2.3.1 cannot 
15 be inferred with a brute force approach (Equation 1 ), they can be optimally inferred by our 
program using a single Pentium CPU operating at 1 .9 GHz for about 10 minutes. In order to 
evaluate the practical possibilities of this approach, we selected 20 genes with known active 
role in gene regulation (see ref. 12) from the data set and estimated a network with optimal 
BNRC score using all 173 microarrays. The computation finished within about 50 hours 
20 using a Sun Fire 15K supercomputer with 96 CPUs, 900MHz each. As a result of this 
computational experiment, we conclude that our method is feasible for gene networks of 20 
genes, even if no constraints are made and a complex scoring scheme like the BNRC score 
is used. For the discrete scores BDe and MDL, which can be computed much faster, even 
networks of more than 20 genes can be infered optimally without constraints. 
25 When the nimiber of parents is limited to about 6 (Corollary 2.1) or, alternatively, sets 

of about 20 candidate parents are preselected (Corollary 2.2), even with the BNRC score gene 



wo 2005/059707 



PCT/US2004/042027 



-26- 

5 networks of more than 30 genes can be infered optimally. However, the method as it is now 
will not allow one to estimate networks of more than about 40 genes. 

While the theoretical time complexity of the approach given in Corollary 2.2 is below 
the time complexity of the approach given in Corollary 2. 1 , we argue that the latter might be 
practically more important. First, limiting the number of parents by a constant can be easily 

10 done and is biologically justified, while selecting a set of candidate parents for each gene 
requires a method of gene selection, which can potentially bias the computation result. 
Second, it has to be considered that each dynamic programming step in the computation of 
fimction F requires the computation of one score, while one dynamic programming step for 
function Monly requires looking up some previous results. When the number of parents is 

1 5 limited as in Corollary 2. 1 , the required number of score calculations becomes a polynomial, 
which makes this approach fester in practical applications, though the approach in Corollary 
2.2 is theoretically superior. 

We have presented a method that allows to infer gene networks of 20-40 genes 
optimally, depending on the probability distribution used and on whether additional 

20 assumptions are made or not. This makes it possible to compare different scoring schemes, to 
assess the best parameters for a given scoring scheme, and to evaluate the usefulness of given 
microarray data, since optimal solutions are obtained. Also, the method is especially useful in 
settings where researchers focus on a certain group of genes and want to exploit gene 
expression measurements concerning these genes to the fiill extent. 

25 hi contrast to heuristic approaches, if the results are unsatisfying or contradictory to 

biological knowledge, it can be concluded that the statistical model is incorrect or the data is 
insufficient. Even for a network of 20 genes, getting to know the best network fi*om the huge 
search space given is a large amount of information. 

We note that the method is not dependent on a certain scoring scheme or a certain 

30 kind of gene expression measurements. It can be applied in any setting, where a score as 
defined in Section 2.2 is given. For example, when sequence information (see ref 1 9), protein 
interaction data (see ref 10), or other knowledge is incorporated in the score function, this 
method can also be applied. 

hi order to find gene networks with more than 40 genes, two directions can be used. 

35 First, if a part of the set of subsets, in which the algorithm performs the actual search, can be 
pruned, the limit of feasibility might be increased. Second, compartmentalization of gene 
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5 networks (see ref. 1 8) might be used to decompose larger networks in smaller parts, and infer 
each partial network optimally. 
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30 Example 3 : Gene Networks That Are Better Than Optimal 
3.1 Introduction 

The estimation of gene networks from expression level measurements is one focus of 
Bioinformatics research in recent years (1, 5, 8, 13, 31). Learning the structure of gene 
networks from expression data may deepen our understanding of the responses of cells to their 
35 environment and our knowledge about construction principles of gene networks (10, 29), with 
possible appUcations in several disciplines. 

A widely used approach to model gene networks are Bayesian networks 3, 4, 9, 13, 
15, 26, 30), which model the expression level of a gene as random variables and gene 
networks as joint probability distributions. These distributions are decomposed using directed 
40 acyclic graphs, which we will call networks . Networks are scored using score ftinctions based 
on the likelihood of networks, given the data. 

A recent result shows that one can get to know optimal gene network models with 
respect to some data for gene networks of about 30 genes or less (25), This result holds for any 
score function s : Gx2^ assigning a score to a gene g and a set of parents for g. However, 
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5 while optimal gene network models are the most likely models, there may still be very 
different models that have approximately the same likelihood with respect to some data, 
especially since gene expression measurements will in general provide only partial 
information about the gene network. Furthermore, even for a gene network of only 1 0 genes, 
there are about 4.17*10*^ possible network models (17). Therefore, even optimal gene 

1 0 network models will in general not match the target gene network. 

Our endeavor to tackle this problem is threefold. First, we provide the theoretical 
basis and an algorithm for the enumeration of optimal and suboptimal networks in the order of 
their likelihood. Second, we rigorously compare optimal gene network models to the available 
knowledge about existing gene networks. Third, we present an 2q>proach to extract the reliable 

15 part of optimal gene network models and apply this approach to the available data of 
B. subtilis and E, coli in order to compare gene network models of related species. 

Our results show that the partial networks identified by our approach are in 
significantly better agreement with the knowledge than an optimal network model itself Since 
we base our approach on the best n gene network models, one can derive conclusions firom 

20 these estimations more reliably than fi-om methods that rely on heuristics. 

After presenting our theoretical results in Section 3.2, we state and discuss results of 
the comparison of estimations and knowledge in Section 3.3. Then, we present our gene 
network estimation approach in Section 3.4, which we evaluate and apply in Section 3.5. 
3.2 Gene Network Estimation and Enumeration 

25 Throughout the Example, we assimie we are given a set of genes G and a score 

function s : Gx2^ IR assigning a score to a gene g and a set of parents for g. Given a 
network N, the score of N is defined as score(N) =£gOGS (g^f^ig)), , where P^(g) denotes the 
set of g 's parents in N. In order to find the joint probability distribution that explains given 
data best, we need to find the directed acyclic graph Nttiat minimizes score{N). 

30 Since this optimization problem is NP-hard (2), and the search space has 

super-exponential size (27), researchers have firequently £^plied heuristic approaches like 
greedy algorithms, simulated annealing, or genetic algorithms (5,13,31). However, recently 
the following result was derived, which leads to an algorithm that can estimate gene networks 
of about 20 genes without constraints, and networks with 30 or htfle more genes, when 

35 making use of the fact that genes in real gene networks have a limited nimiber of parents. 
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5 Theorem 3.1 (25) 

Optimal networks can be found using 0(n • 2") dynamic programming steps. 

For larger gene networks, empiric or heuristic methods can be used to select a 
subspace of the search space. If this is done as described in (24), the algorithm of (25), can be 
used to find optimal solutions within the selected subspace, yielding a combined approach of 
1 0 deterministic and heuristic techniques. 

However, different networks may yield the same expression data with equal or 
approximately equal probabiUty. Therefore, one may not assume that given data contains the 
complete information about a gene network, but one has to bear in mind that the information 
that can be derived from a given dataset might be very well partial information. Thus, even if 
15 an optimal network is found, it will in general contain wrong edges, hi order to overcome this 
problem, we have extended the algorithm of (25) and proved the following result. 



Theorem 3.2 

Let m 0 IN. The best m networks can be found using 0{n -2") dynamic programming 

20 steps. 



3.2.1 Proof - Enumerating Optimal Gene Networks 
3.2.1.1 Algorithm 3.1 

25 We herein show how to extend the algorithm from (25) in order to allow the 

enumeration of optimal and suboptimal networks. We first define some functions, and then 
show how these functions can be computed for gene networks of considerable size. As in 
Section 3.2, we assume we are given a set of genes G and a score function s: Gx2^ ! IR, 
Given a network the score of N is defined as score(N) = 3gOGS (g,P^(g)X , where P^(g) 

30 where P^(g) denotes the set ofg 's parents in A^. Throughout this section, we assume networks 
with equal score to be sorted in some way, therefore allowing the notion "the n -th best 
network". 



35 



Definition 3.1: 

Letw6/A^. We define a; 2*^ a:/ 2^ inductively. First,forallg6 G and 
A^G, we define 



wo 2005/059707 



PCTAJS2004/042027 



-31- 

5 F"(g, v4, n) =defarg min s(g,B), 

(2) 

Then, denoting the set of all previous solutions {F^ig, A, p) \p<n} as J{n\ 
F"(g, A, n) =defarg min s(g,B) 

B^ (3) 
10 BeJ(n) 
for all \<n ^m. 

Definition 3.2: ST 

Let mO IN, We define ST : Gx2^xIN^-^IR as 

15 S'^fe A, n) ^defs(F"(g. A, n)) (4) 

for all g e G, A^ G, and « 6 IN^^n^ 

By the definition, F^{g^,n) is the n -th best choice of parents for a gene g when the 
20 parents have to be selected fi*om A^ and S^{gyA,n) is the score for this choice. When m is clear 
fi-om the context, we use F and S instead of and 5^, respectively. Note that and are 
partially defined fiinctions, since m may be larger than the number of subsets of ^. 

An ordering of a set AfG can be described as a permutation B:{1,...,|^|-^^. 
Let us use Y]^ to denote the set of all permutations of A. We need the following notation, 
25 which denotes the networks in which all edges comply with the direction given by a 
permutation 

Definition 3.3 : n -linearity 

Let A^G and Tte]^, Let NcA jc ^ be a network. We say A'^ is ;r -linear iff for all 
30 {g,h)EN 7f'{g)<7f\h) holds. 

The basic strategy of our algorithm is to divide the space of all directed acyclic graphs 
on a set A^G in subsets of ;r-linear networks, for all TteX^, decomposing the problem of 
finding optimal and suboptimal networks to the following two problems: 

1 . Find the subset of the search space that contains the (sub-)optimal 
35 network searched for. (Lemma 2) 

2. Find the (sub-)optimal network within the selected subspace. 
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5 (Lemma 1 and Lemma 2) 

We will use n to denote a subspace of the search space. In order to find optimal and 
suboptimal networks for a given permutation n , we need the following function 0^. For a 
given gene g, we denote the set of all genes that precede gin ;r as F(;r,g)=def 7r'(A)<7r^ 

(g)}. 

10 

DefinitiGii 3.4: ^ 

Let A^G, We define as rll^ x 7?v'^'-^2^ as 

^iny) =rf^{(/i,g)l/^eFfeK(;r,g), v,.,,(^),)} (5) 

15 for all ;r elf. 

In Definition 4, we have used a vector v6/A^'^ ' to determine the rank of the selection of 
parents for the particular genes. In Lemma 1 and Lemma 2, it will be shown that g^(;r ,v) is an 
optimal or suboptimal ;r-linear network, it's rank depending on v. Next, we define two 
functions A/" and D'" that specify subspaces, in which (sub-)optimal networks can be found, 

20 and the choice of a network firom the subspace, respectively. 

Dermition 3.5: NT^IT 

IgI 

Let/w ei?/. WedefineAT : 2^x/Ar^ -►U^ccE^ andzr : 2^x/Ar,„-^U--^ 

1 = 0 

inductively over their second parameter. Let A^G. First, we define 

25 

zr(^,7)=rf^(l,...,l)e/A^^I (6) 

and 

hT'iA,!) =d^arg min scoreiQf'in, iTiA, 1))) (7) 

30 

Let TielN^rn with n> 1 and let A'^ be a network on A with optimal score among networks not in 
{Q^{Nf"{A,p), DM(A,p)) \p<n} .Let 7r*e JY* be a permutation such that N is tt* -Hnear. We 
define 

Ar(^,n)=def (8) 



35 
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5 Let V* 6 /A^^' such that for every geA, the set of g 's parents, P^(g), equals 

i^fe K(^*,^),vVi(g)).' (9) 

We define: 



10 



(10) 



15 



As and STy A/" and are partial fiinctions. In Table 3.4, we summarize above notations. 
Using these notations, our algorithm can be defined as flows, given mO IN: 



Table 3.1 Functions Used to Define the Algorithm. 

The meanings follow directly from the definitions and Lemma 3.1. 



Function 


Functionality 


Meaning 






^{gA^n) is the n -th best choice of parents for g from 
A 






Q^(7r,v) is a ;r -linear network 


AT 




the n -th best network on A is Af"(A,n) -linear 


LT 


t - 0 


the n -th best network on A is Q*(Af'(A,n)J)'"iA,n)) 
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5 Using these notations. Algorithm 3.1 can be defined as follows. 



Step 1: 


Set I^(g,0A) = 0, sr(g,0A) = s(g,<p) for all geG. 


Step 2: 


For all g ^ G, all AqG, A and all n^m do the following two steps: 


Step 2a: 


Selects* from 




{B^^\B=AvB=I^(g,A-{h},p),h€A,p^m} - {I^(g,A,p)\p<n} 




such that sigyB*) is minimized. 


Step 2b: 


Set/^te^,«)=5*, S"(gA,n)=s{gX), 


Step 3: 


Set Ar(0,l)=<^ and ir{<p,l)=<p. 


Step 4: 


For all /4cG, 0, and all n^m do the following three steps: 


Step 4a: 


Choose a triple (g,p,q) ^A x IN^m x IN^m such that 




score{Q'-^'\Ar{A-{g},p)J)^iA-{g}4^^^^^ 




is minimized and (g,p,q) induces a network different from 




Q'{Ar(A,r)J)"'(A,r)) for r<«. 


Step 4b: 


Setiir(^,n)(0=Ar(^-{g},;7)(0 for i< |^|, and Ar(^,«) (|^|) =g,. 


Step 4c: 


Let V denote D'"(.4- {g} Set w G /TV"^' as wr=V/ for all / < |^ | and 
Set Zr(^,/i) = w. 


Step 5: 


Return Q%Ar{G,i)jr{G,i)) for all i^w. 
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5 

Algorithm 3. 1 computes the functions and in Step 1 and in Step 2 for all geG, 
Aq.G, and n^w. In order to select 5^ in Step 2a, only fimction values of for a s^iA of lower 
cardinality or lower n are needed. Therefore and can be computed applying dynamic 
programming. 

10 In Step 3 and Step 4, functions A/" and can be computed similarly using dynamic 

programming, since for the selection of a triple in Step 4a only function values of A/^ and 
rr for a set^ of lower cardinality or n of lower cardinality are needed in order to compute 
Af"{Ajn) and rf(A,n\ The meaning of the triple {gp,q) is as follows, g is a candidate for the 
last element in the permutation searched for. When g becomes the last element, the remaining 

1 5 permutation can be chosen from up to m previously computed permutations of A-{g} . The 
remaining permutation chosen is specified by p. Then, to form a network in the subspace 
defined by the resulting permutation, the q -th best selection of parents for is used, while for 
the other genes parents are selected as indicated by Z)'"(/4-{^},/?). 

Since all four functions *S^, AT", and Z)'" are partially defined, not for all n function 

20 values can be calculated for sets A of low cardinality. For simplicity, we did not explicitly 
mention this in the formulation of the algorithm. 

Algorithm 3 . 1 , as stated above, computes a fixed number (w) of solutions, regardless 
of the size of the set ^. In practical applications, the computation time can be optimized by 
calculating a fewer number of (sub-)optimal solutions for layers / with a large number of 

25 subsets of G with / elements. Then, when the number of subsets declines for high cardinality 
of Ay m can be increased. There is no guarantee that more than m (sub-)optimal solutions can 
be derived when only m optimal solutions are known for lower layers, but this worked in test 
computations. 

Since the number of networks that have to be stored during the computations can get 
30 very large, it is important to store networks efficiently in memory, while allowing fast 
decoding of stored networks. In our implementation, based on the formulation above, we store 
permutations n and make use of the restrictions for edges in ;r -linear networks, yielding a 
memory- and time-eflScient coding of networks. 
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5 In order to fulfill the requirement of Step 4a that the chosen network is different fix)m 

networks chosen previously, networks have to be compared during dynamic programming 
steps. However, since networks with different scores are different, only networks with equal 
score need to be compared. Therefore, the number of network comparisons can be minimized 
in practical applications. 

1 0 When one layer of i**", A/", or ET is computed, this can be done in an arbitrary order 

for the sets in the layer, and the computation depends only on values in the lower layer. 
Therefore, the above algorithm is well parallelizable. 



3.2.3 Correctness and Complexity 

15 Let us denote the n -th best network on a set A^G by A^*^,«. We first state two 

reformulated lemmata firom (25). 

Lemma 3.1 Let A^G and ttg Jl^. Let A^*^ x ^ be a ;r-Unear network with 
minimal score. Then, Q^i^Xl,. . . yl))=score(N^ holds. 

20 

Lemma 3.2 LetA^Gandm eIN. Let g* ^ argmin^^A (Sm(g, A = {g} A) ^*A-{gu)- 
Define tteIY^ by 7r{i)=M(A-{g*},\)(i), and 7r\A\=g\ Then,:;z-=Ar(^,l). 



Using these, we can prove the correctness of the algorithm defined above. 

25 

Theorem 3.2 Lei me IN 

The best m networks can be found using 0(n l") dynamic programming steps. 

Proof, The output of the algorithm, Q%hf(G,i)jT(G,i)\ i^m, are the best 
30 m networks on G by the definitions. We only need to prove that the recursive formulas given 
in the algorithm are correct. The equation given in Step 1 are correct by the definition of 
and .S^. When we select a subset of a stt A?G in Step 2, we have basically two choices: the 
whole set ^ or a true subset. In the former case, we can compute the score of the choice 
directly, in the latter, we can use previously computed values of and 5^, which gives the 
35 correctness of Step 2. 
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5 After the execution of Step 2, we have all values of and ST computed. Using these 

values, function Q can be computed directly. Therefore, we only need to compute flmctions 
A/" and in order to produce the output in Step 5. The equations in Step 3 are again correct 
by the definitions. We observe that with Lemma 3. 1 in combination with Definition 3.5, the 
following equation holds: 

10 

ira,n = QA(hr{A^n)fir(A,n)) 

From this equation and Lenraia 3.2 we see that the recursion in Step 4 is correct for 
«=1 . For n>\ , we compute the suboptimal permutation M^(A,n) and the suboptimal choice of 

1 5 parents LTiA^n) in the same way, restricting to a network not previously chosen. 

The time required for one dynamic programming step depends on w, but can be 
regarded as constant for m as chosen in the computations of this work. Furthermore, using a 
programming technique described in the Appendix, in practical computations it is sufficient to 
compute significantly less than m networks in most dynamic programming steps. 

20 Using this algorithm, we can enumerate optimal and suboptimal networks in the order 

of their likelihood. When we can find common components of such networks, the likelihood 
of the common components is the sum of the likelihood of the networks containing it. 
Therefore, common components of enumerated networks, called ^ene network motifs in this 
work, Therefore, our definition of a gene network motif is, at first, different from the notion 

25 used, for example, in (2 1 ), but might turn out to be closely related. Therefore, network motif 
analysis is expected to be more reliable than single network estimations. 

3.3. Evaluating the Reliability of Gene Network Estimations 

In order to validate that enumerated optimal and suboptimal gene network models 
30 contain valuable information about real gene networks, we implemented the algorithm and 
applied it to RNA microarray data. Considering the close relationship of transcription and 
translation in bacteria, we expect gene network estimations based on RNA data solely to be 
more suitable for bacteria than for eukaryotes, in A^iiich transcription and translation take place 
at different places and at a different time. Therefore, we chose Bacillus subtilis and 
35 Escherichia coli as targets, for which microarray data as well as knowledge about the gene 
networks is available. 
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5 

3.3.1 Data and Software 

For E. coli, we selected the datasets GDS95-GDS100 from the Gene Expression 
Omnibus (35). Changes in gene expression levels were elicited by perturbations of tryptophan 
metabolism, UV exposure, and novobiocin treatment (17, 18, 19). We then received data 

1 0 concerning known transcriptional regulation in E. coli from RegulonDB (28) for comparison 
with estimation results. 

For J5. subtilis, we used several datasets including 70 microarrays from time course 
experiments under various treatments, and 99 microarrays from gene disruptant experiments. 
The data is not yet publicly available, though it is was confirmed that biologically meaningfiil 

15 estimations can be done using these data (12). We then received a dataset of known 
transcriptional regulation from DBTBS (16, 20). 

Among the score functions {s : G x l^'-^IR) used by our implementation, which 
include the MDL score (4), the BDe score (3, 4), and the BNRC score (13), we selected the 
BNRC score for the computations in this work for the following reasons. First, the BNRC 

20 score uses the data without discretization, avoiding additional parameters and loss of 
information. Second, gene interactions are modeled using B-splines, allowing for a general 
modeling not restricted to linear relationships. Third, in preliminary computational 
experiments using all three score functions on the B. subtilis dataset, the estimations using the 
BNRC score were in significanfly better agreement with textbook knowledge (32). 

25 

33.2 Selection of Target Networks 

From the dataset of known regulatory relations for E. colU we selected all relations for 
which experimental evidence is provided. This yielded a set of 899 known relations. From the 
B, subtilis dataset, we selected 840 regulatory relations with evidence in the literature. Li order 

30 to select parts of these large networks, we applied a random procedure. Since we need to 
select genes in a way that there are some known regulatory relations among the selected genes, 
we select the first few genes randomly, and then select genes that are coimected to the 
previously selected genes iteratively, expanding the selected partial network in each step by 
one gene and at least one edge. In each iteration we selected a connected component of the 

35 partial network with equal probability, and then selected a gene with a known relation to at 
least one gene in the component randomly, if such a gene exists. Since we should avoid trivial 
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5 choices of target networks, we chose a gene not connected to the previously selected genes, 
when five connected genes have been selected in a row. 

The selection procedure yielded a known gene network N with non-trivial structure. 
We represented as a matrix. Each pair of genes with a known relationship is represented 
with a 1 entry in the matrix. Pairs of genes with no knowledge are represented by a 0 entry for 
10 most pairs, but a 0.5 entry for pairs for which one or more of the following four 
conditions hold. 

1 . gis regulated by h 

2. There is a gene / in the target network that regulates g and h 

3. There is a gene i in the target network that is regulated by g(h) and regulates h(g) 
15 4. Condition 2 or condition 3 hold for a gene i outside the target network 

Using these conditions, nearly correct predictions were distinguished from wrong 
predictions. If edge (g,h) is predicted, and (h^) is a known regulatory relation, then at least the 
fact that these two genes interact, was correctly predicted. In the same way, indirect regulatory 
relations, possibly via some gene not included in the target network itself is also not entirely 
20 wrong, if predicted. 

3.3.3 Evaluation Results 

We randomly selected 30 sets of 10 genes as described in Section 3,3.2, and applied 
the method to each target network. Then we counted how many times each relation was 
25 estimated in the best 500 network structures and examined the relationship between the 
frequencies of estimations and biological knowledge. Note that the resulting networks of our 
method are directed. However, as we mentioned in the previous section, estimated edges with 
wrong orientations still have information. So we considered both directed and undirected 
cases. 

30 It is difficult to know whether each estimated edge is correct or not, because there 

may be still undiscovered relations. Furthermore, known gene relations may not be expressed 
in the data. Therefore, we categorized the estimated edges into three groups based on the 
database information (1 , 0.5 and 0) and examined the difference of them. We presumed that 
good estimations can separate these groups well. 

3 5 Figures 2(a) and 2(b) show the results of B, subtilis time course and disniptant data, 

respectively. The x-axis shows the percentage of appearances of each edge in 500 networks. 



wo 2005/059707 



PCTAJS2004/042027 



^0- 

5 The y-axis shows the fraction of edges of the corresponding groups, that fall into the interval. 
Since we observed two peaks at both ends of the x-axis, we examined these two regions 
closer. 

Table 3.2: Frequencies of Edges Around 0 to 10% and 90 to 100% 





0-10% 




90-100% 






1 


0.5 


0 


1 


0.5 


0 


time course 


Undirected directed 


49.8 


48.1 


59.4 


29.3 


24.0 


15.6 






68.7 


66.0 


74.6 


7.72 


7.27 


4.47 


disruptant 


undirected 


57.9 


60.4 


76,3 


24.7 


18.7 


5.92 




directed 


73.6 


76.6 


87.3 


10.2 


7.07 


2.14 



10 Table 3.2 shows the percentages of frequencies. From Table 3.2 we observed the 

tendencies of three groups. Around 0%, group 0 showed more than 10% higher frequencies 
than the others. On the contrary, around 100%, the frequencies dropped to the lowest values. 
On the other hand, group 0.5 indicated closer values to group 1 . Comparing time course and 
disruptant data, the groups were well distinguished while using disruptant data. 

15 



Table 3.3: Frequencies of Edges Around 0 to 10% and 90 to 100% 





0-10% 




90-100% 






1 


0.5 


0 


1 


0.5 


0 


undirected 


60.0 


51.7 


63.3 


17.8 


24.8 


13.7 


directed 


72.0 


74.2 


79.1 


7.11 


7.90 


5.54 



Finally we analyzed E. coli microarray data and the result is shown in Figure 3 and 
Table 3,3 . Interestingly, the differences between the groups are somewhat small in this result. 
20 We presumed that the proposed method can evaluate not only the goodness of scoring 
function but also the accuracy of data. 
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5 3.4 Extraction of Gene Network Motifs 

While the evaluation in the previous section was based on counts of edges solely, we 
now consider using the complete information of enumerated networks. 
3.4.1 Motif Extraction Problem 

Theorem 3.2 in combination with the techniques described in this Example shows 

10 the possibility to enumerate optimal and suboptimal networks for gene networks of 
considerable size. A straightforward approach to exploit the information given by 
enumerated networks would be to count the number of occurrences of each edge in the 
networks, select only the edges, which have a count above some threshold, and compose 
a partial network from these edges. This approach was used in (26) in order to extract 

15 partial networks from networks enumerated by the bootstrap method. Under the 
assumption that the confidence levels of edges are independent from each other, it was 
shown that regions of significantly many edges with high confidence levels can be found 
in gene network estimations. However, this assumption does not account for the fact that 
the confidence levels of all edges connected to the same gene g depend on the expression 

20 measurements of g. 

However, even if each single edge has a count above some threshold, the partial 
network composed fi-om these edges does not necessarily have to be frequent in the 
networks at hand. Therefore, a partial network composed from likely edges may be 
unlikely itself This leads to the following problem, which we refer to as the gene 

25 network motif extraction problem: 

Given graphs Ni=(G,Ei),. . . ^m=(G,E^), and k e IR, find a set M c G G with 
\M\ = k maximizing the number of graphs Ni which include Af, i.e. Mc^,. 

30 The motif extraction problem is equivalent to the well-known problem of finding 

maximal frequent item sets from data mining. The problem of finding balanced complete 
bipartite subgraphs (6) can be reduced to both problems, which are therefore NP-hard. 
However, the problem can be solved for practical instances using the exhaustive search 
strategy we describe in the following. 



35 
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5 3.4.2 Motif Extraction Algorithm (Algorithm 3.2) 

In order to extract the partial information about gene networks contained in given 
gene expression measurements, we propose the following strategy, which uses three 
parameters n, c, K ^JN, 



Step 1: 


Enumerate the most likely gene network models Ni, 1 <,i^n. 
(Theorem 2, Appendix) 


Step 2: 


For every g, h eOG, coimt the occurences of the edge (g,/i) in the networks Ni. 


Step 3: 


Select all edges (g,h) with at least c occurences. 


Step 4: 


For all subsets M of the set of selected edges with \M\ = k, count the networks 
including all edges in M 


Step 5: 


Retum all motives M with at least c occurences. 



This algorithm makes use of the fact that every edge in a motif with at least 
c occurrences must have at least c occurrences itself Therefore, the number of candidate 
edges for composing motifs with more than c occurrences can be essentially reduced in 
practice. The running time of this exhaustive search strategy gets infeasible when the 
threshold c is set too low, but this did not impose practical limitations for our computations, 
since we are interested in gene network motifs with high confidence. 

3.5. Motif Extraction Results 

As in Section 3, we chose bacteria as a promising target for our estimations. We first 
20 evaluate the reliability of motif extractions, then we apply the method to a group of genes 
that is involved in tryptophan metabolism in B, subtilis as well as in coli. 



10 



15 



3.5.1 Finding Motifs in Bacterial Gene Networks 

In order to evaluate the predictive strength of the motif extraction algorithm, and 
25 to compare it to the predictive strength using the most likely gene network model alone, 
we applied it to the disruptant dataset for B. subtilis. We selected 25 target gene 
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5 networks Ni of 8 genes in a random way as described in Section 3.3, enumerated the 
most likely 100 networks, and extracted motifs with 2 or more edges, using 90 as the 
threshold (parameter c in our algorithm). This computation was performed using a single 
CPU with 1 .9 GHz for about 3 hours. We note that the enumeration can be applied to gene 
networks of up to about 30 genes, if a supercomputer is used. 

1 0 We then selected randomly one edge of the optimal network model, and one edge 

of the motif with the highest score among the motifs with the most edges. We then 
checked the correctness of both edges, using the DBTBS data, and computed the 
probability pi of randomly guessing a single edge from the known network correctly by 
dividing the nimiber of edges with a known regulatory relationship by the total number 

15 of possible edges in N^, According to the results of Section 3, we judged 1 -entries and 
0.5-entries as correct. We computed an upper bound for the probability of guessing at 
least k single edges correctly among n networks, P(njc% by using the following 
inequality 

20 P{n^k)^ S(-)y(l-pr , (1) 

where p denotes max The result of this computation is given in Table 3.3. We 

observed that the results for the motif extraction approach are more strongly significant than 
the results for optimal gene networks in the case of directed motifs, and equally significant in 

25 the case of indirected motifs. 

This is due to the fact that while in no case the optimal network model was the empty 
graph, for some target networks there was no motif above the threshold. Since our random 
selection of target networks is likely to choose some networks, that are not expressed in the 
data, we can not expect to find high scoring motifs for any set of genes, therefore predicting 

30 regulatory relations might be not appropriate for some target networks. We note that our 
results also hold for the predictions as a whole, since we selected arbitrary edges for the 
evaluation. 



35 
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5 Table 3.3: Evaluation Result for the Motif Extraction Algorithm 



Method 


Number of 
Edges 


Correct Edges 


p-value 


directed edge from directed motif 


22 


15 


0.00578 


undirected edge from directed motif 


22 


15 


0.00578 


undirected edge from undirected 
motif 


25 


16 


0.01083 


directed edge from optimal network 


25 


16 


0.01083 


undirected edge from optimal 
network 


25 


16 


0.01083 



3.5.2 Comparing Gene Networks of Related Species 

The methods described can help to unravel differences and similarities in gene 
1 0 regulatory networks of related species such as Gram-positive rod-shaped Bacillus subtilis and 
Gram-negative rod-shaped Escherichia coli, e.g. by applying the algorithms described to 
microarray data obtained from time course experiments of both E, coli (experiments for 
tryptophan metabolism) and B, subtilis and genes known to be involved in the well-studied 
tryptophan network. We extracted directed motifs based 100 enumerated optimal network 
15 models, using 50 as the threshold. A graph was derived from the output file showing the 
largest motif with the highest hits found in the data sets, genes are presented by nodes and 
each edge is labeled with its weight. 

The largest motif obtained the data set for E. coli was found 54 times and contains 1 3 
edges with weights ranging from 79 to 100 (Figure 4), the one for 5. subtilis data was found 
20 61 times and contains 15 edges, weights ranging from 77 to 100 (Figure 5). 

TrpA, trpB, trpC, trpD, trpE are linked by high weighted edges in both motifs, 
forming the core of this regulatory network, corresponding to the fact that they are positioned 
close to each other in the trp-operon in both species. Even the order of position in the 
trp-operon can partially be recognized in the graphs. But in B. subtilis, seven genes code for 
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5 the enzymes in the tip biosynthesis pathway, as opposed to five in E, colL The two extra 
genes, trpF and pabA are also contained in the derived motif. trpF is connected to trpB and 
trpD, corresponding to its approximate position on the trp-operon. PabA, also called trpG in 
B. subtilis^ is found closely connected to trpA and trpB although it is not located in the 
trp-operon but in the folate operon citehenner93,gollnick02. This close connection may 

1 0 indicate the close fimctional relation in the tryptophan biosynthesis pathway. In E. coli there 
also exists a pabA gene, but the results show no connection to any of the genes in the 
tryptophan network, consistent with the fact that it has not been found to be involved. 

On the other hand, mtrB is closely linked to trpE and trpF in the graph for B. subtilis. 
This can be explained by the fact that its gene product, tryptophan RNA-binding attenuation 

1 5 protein (TRAP) has been found to be among the key regulators of tryptophan biosynthesis in 
B, subtilis (24). Interestingly, mtr, coding for a tryptophan specific transport protein in E. coli 
is also associated with core genes of tryptophan biosynthesis like trpB and trpE in the graph. 
The structure of the network motifs suggests that the fiinction of mtr in E. coli might be 
similar to the one of mtrB in B, subtilis, 

20 Recently, a TRAP-inhibitory protein (anti-TRAP, AT) has been found and has been 

identified as the gene product of yczA (34). However, no association between yczA and the 
other genes examined can be found in the results. But this maybe due to too lowyczA-mRNA 
levels. Or the AT might be present in the cell throughout, without any need of new translation 
fi-om yczA-mRNA. This can well be since AT does not act alone but depends on tryptophan 

25 levels in the cell and only binds to Trp-activated TRAP (34). No homologue of yczA has been 
found so far in E, coli^ but BLASTP searches revealed that an amino acid sequence of AT 
shows high similarities to that of cystein-rich domains of the chaperone dnaJ (34). Therefore 
dnaJ had been included into this calculation, the results suggest an association with the 
tryptophan network in E. coli and even more in B, subtilis. The reason might be that 

30 chaperones can interact with all kinds of pathways. But it might also be caused by 
cross-hybridization with yczA-mRNA, Or dnaJ does actually play a role in the trp network. 
Because of the similarity on protein level, it might even fimction as a TRAP regulator. This 
would explain the strong link between dnaJ and mtrB in the graph for B. subtilis. 

TrpS, coding for the tryptophanyl-tRNA synthetase has been included into this 

35 analysis because it had been shown that tRNA^''' plays an important, yet different, role in the 
regulation of tryptophan biosynthesis in both bacteria. The structures of the respective network 
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5 motifs differ indeed. The graph for E, coli shows trpS being linked to mtr and wrbA. As 
mentioned above, mtr codes for a tryptophan specific transport protein (23) and may be 
involved in transcription termination which can be observed in abundance of tRNA^''' (34), 
stalling at the leader peptide, which is encoded by tipL, corresponding to the edge fi-om trpL 
to mtr in the graph. wrbA encodes a tryptophan repressor binding protein, so this edge 

10 corresponds to Tip activating the tryptophan repressor. However, there is no direct link 
between trpS and trpR, the gene coding for the tryptophan rqjressor. This can be explained by 
the fact that arrays measure mRNA ejqjression, and the tryptophan repressor protein acts 
through conformational change when Tip binds, not on the level of transcription. Yet the 
grqjh shows an association of trpR with trpD, which may indicate a generell production of 

1 5 tryptophan repressor protein together with tryptophan biosynthesis enzymes, though trpR is 
not located in the trp-operon. hi the B. subtilis graph, trpS expression is linked to pabA and 
dnaJ. If dnaJ functions as yczA, this is consistent with the finding that tRNA^*^ has been found 
to be associated with the formation of AT-inactivated TRAP (34). 

However, there are limitations. Microarray data are subject to noise and monitor on 

20 mRNA level only. There were only few E. coli data sets (18 arrays); for B. subtilis no 
experimental data fi-om experiments designed for analysis of tryptophan network had been at 
the inventor's disposal. But E. coli experiments had been designed for studying tryptophan 
metabolism, and B. subtilis data sets were obtained from bacteria growing on various media, 
i.e. under various nutritional conditions. So differences in the tryptophan levels in the media 

25 are to be expected, affecting the tryptophan gene regulatory network. Thus, given these data, 
the tryptophan network was the best target, although the set of genes chosen might not have 
been optimal, and suitable for evaluating the computational methods described. 

3.6. Conclusion 

3 0 We have provided the theoretical basis for the enumeration of optimal and suboptimal 

gene network models for gene networks of considerable size, presented results of a 
comprehensive comparison of optimal models to knowledge, introduced a method for 
extracting a reliable part of optimal network models, and applied this method. 

Extraction of the most reliable moti& fix)m optimal estimations based on 

35 state-of-the-art scores is valuable in itself and opens up the way to the analysis of common 
gene network motifs (network motif in the sense of), comparison of gene rietworks among 
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5 related species, and so on. 

The algorithmic methodology described in this work is not dependent on a certain 
scoring scheme or a certain kind of gene expression measurements. Therefore these 
techniques are generally applicable in all situations, where a score s with functionality s:Gx 
! IR is given, which is the case for all scores within the Bayesian network framework, but 
1 0 also for most other score functions. This is an important property for gene network estimation 
techniques, since work on new scores incorporating previous knowledge such as sequence 
information (33) or protein interaction data (14, 22) is on-going. 

For gene networks of size beyond the computational limits of our algorithm, 
techniques as in (24) can be applied in order to restrict to a part of the search space that is 
1 5 likely to harbor biologically meaningful networks and to find optimal network models within 
the selected subspace. We note that our enumeration algorithm can be applied in this case as 
well, such that our approach of finding gene network motifs is not limited to small gene 
networks. 

Rigorously assessing the accuracy of gene network estimations using available 
20 knowledge, as we conducted in Section 3 .3 , is a promising approach to develop standards for 
comparing the strength of gene network estimation methods. 

Each of the references cited herein are herein incorporated fully by reference. 
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